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Abstract: Data mining is a process of analyzing data in order to bring about patterns or trends from the data. Many 
techniques are part of data mining techniques. Other mining techniques such as text mining and web mining also exists. 
Clustering is one of the most important data mining or text mining algorithm that is used to group similar objects together. 
In other words, it is used to organize the given objects into some meaningful sub groups that make further analysis on data 
easier. Clustered groups make search mechanisms easy and reduce the bulk of operations and the computational cost. 
Clustering methods are classified into data partitioning, hierarchical clustering, data grouping. The aim of this paper is to 
develop a new method that is used to cluster text documents that have sparse and high dimensional data objects. Like k- 
means algorithm, the proposed algorithm work faster and provide consistent, high quality performance in the process of 
clustering text documents. The proposed similarity measure is based on the multi-viewpoint. 
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I. Introduction 

Clustering is a process of grouping a set of objects into classes of similar objects and is the most interesting concept 
of data mining in which it is defined as a collection of data objects that are similar to one another. Purpose of Clustering is to 
group fundamental structures in data and classify them into a meaningful subgroup for additional analysis. Many clustering 
algorithms have been published every year and can be proposed for developing several techniques and approaches. The k- 
means algorithm has been one of the top most data mining algorithms that is presently used. Even though it is a top most 
algorithm, it has a few basic drawbacks when clusters are of various sizes. Irrespective of the drawbacks is under standability, 
simplicity, and scalability is the main reasons that made the algorithm popular. K-means is fast and easy to combine with the 
other methods in larger systems. 

A common approach to the clustering problem is to treat it as the optimization process. An optimal partition is found by 
optimizing the particular function of similarity among data. Basically, there is an implicit assumption that the true intrinsic 
structure of the data could be correctly described by the similarity formula defined and embedded in the clustering criterion 
function. An algorithm with an adequate performance and usability in most of application scenarios could be preferable to 
one with better performance in some cases but limited usage due to high complexity. While offering reasonable results, k- 
means is fast and easy to combine with the other methods in larger systems. The original k-means has sum-of-squared-error 
objective function that uses the Euclidean distance. In a very sparse and high-dimensional domain like text documents, 
spherical k- means, which uses cosine similarity (CS) instead of the Euclidean distance as the measure, is deemed to be more 
suitable [1], [2]. The nature of similarity measure plays a very important role in the success or failure of the clustering 
methods. Our objective is to derive a novel method for multi viewpoint similarity between data objects in the sparse and 
high-dimensional domain, particularly text documents. From the proposed method, we then formulate new clustering 
criterion functions and introduce their respective clustering algorithms, which are fast and scalable like k-means. 

II. Related work 

Document clustering is one of the important text mining techniques. It has been around since the inception of the 
text mining domain. It is the process of grouping objects into some categories or groups in such a way that there is 
maximization of intra cluster object similarity and inter-cluster dissimilarity. Here an object does mean the document and 
term refers to a word in the document. Each document considered for clustering is represented as an m - dimensional vector 
"d". The "m" represents the total number of terms present in the given document. Document vectors are the result of some 
sort of the weighting schemes like TF-IDF (Term Frequency -Inverse Document Frequency). Many approaches came into 
existence for the document clustering. They include the information theoretic co-clustering [3], non - negative matrix 
factorization, probabilistic model based method [4] and so on. However, these approaches did not use specific measure in 
finding the document similarity. In this paper we consider methods that specifically use the certain measurement. From the 
literature it is found that one of the popular measures is the Eucludian distance: 



Dist(di.dj) = 114-^11 

K-means is one of the important clustering algorithms in the world. It is in the list of top 10 clustering algorithms. 
Due to its simplicity and ease of use it is still being used in the data mining domain. Euclidian distance measure is used in k- 
means algorithm. The main purpose of the k-means algorithm is to minimize the distance, as per the Euclidian measurement, 
between objects in clusters. The centroid of such clusters is represented as follows: 
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In text mining domain, the cosine similarity measure is also widely used measurement for finding document 
similarity, especially for hi-dimensional and sparse document clustering . The cosine similarity measure is also used in one 
of the variants of A;-means known as the spherical A:-means. It is mainly used to maximize the cosine similairity between the 
cluster's centroid and the documents in the cluster. The difference between k-means that uses the Euclidian distance and the 
k-means that make use of cosine similarity is that the former focuses on vector magnitudes while the latter focuses on vector 
directions. Another popular approach is known as the graph partitioning approach. In this approach the document corpus is 
considered as the graph. Min - max cut algorithm is the one that makes use of this approach and it focuses on minimizing 
the centroid function: 
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CLUTO [1] software package is a method of document clustering based on graph partitioning is implemented. It builds a 
nearest neighbor graph first and then makes clusters. In this approach for given non-unit vectors of document, the extend 
Jaccard coefficient is: 

Sim e}acc (iiiJij)- 



||ui||2 + ||uj||2-uti uj 



Both direction and magnitude are considered in the Jaccard coefficients when compared with cosine similarity and 
Euclidean distance. When the documents in the clusters are represented as unit vectors, the approach is very much similar to 
cosine similarity. All measures such as the cosine, Euclidean, Jaccard, and Pearson correlation are compared . The 
conclusion made here is that the Eucldean and the Jaccard are best for web document clustering. In [1], the authors research 
has been made on categorical data. They both selected related attributes for a given subject and calculated distance between 
two values. Document similarities can also be found using the approaches that are concept and phrase based. In [1], tree 
similarity measure is used conceptually while proposed phrase-based approach. Both of them used an algorithm known as 
the Hierarchical Agglomerative Clustering in order to perform the clustering. For XML documents also measures are found 
to know the structural similarity [5]. However, they are different from the normal text document clustering. 

in. Proposed work 

The main work is to develop a novel multi viewpoint based algorithm for document clustering which provides 
maximum efficiency and performance. It is particularly focused in studying and making the use of cluster overlapping 
phenomenon to design cluster merging criteria. Proposing a new way to compute the overlap rate in order to improve the 
time efficiency and — the veracity □ is mainly concentrated. Based on the Hierarchical Clustering Method, the usage of the 
Expectation-Maximization (EM) algorithm in the Gaussian Mixture Model to count the parameters and make the two sub- 
clusters combined when their overlap is the largest is narrated. In the simplest case, an optimization problem consists of 
maximizing or minimizing a real function by systematically choosing the input values from within an allowed set and 
computing the value of the function. The generalization of optimization theory and the techniques to other formulations 
comprises a large area of applied mathematics. 

The cosine similarity can be expressed be expressed as follows: 

Sim(d. ,dj) = cos(d. - 0, d . - 0) = (d t - 0)' (d . - 0) — (5) 

where "0" is vector 0 that represents the origin point. According to this formula, the measure takes "0" as one and only 
reference point. 

The similarity between the two documents is defined as follows : 

sim(d i ,dj) = — - — ^ sim(d i - d h , d . —d h ) — - (6) 

d t ,djGs r n — n r d h es\s r 



The multi view based similarity in equ. (6) depends on particular formulation of the individual similarities within the sum. 
If the relative similarity is defined by the dot product of the difference vectors, we have: 
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n ~ n rX 



The similarity between the two points di and dj inside cluster Sr, viewed from a point dh outside this cluster, is equal to the 
product of the cosine of the angle between di and dj looking from dh and the Euclidean distances from dh to these two 
points. 

Now we have to carry out a validity test for the cosine similarity and multi view based similarity as follows. For each 
type of similarity measure, a similarity matrix called A = {aijjnxn is created. For CS, this is simple, as aij = dti dj . The 
procedure for building MVS matrix is described in Procedure 1 . 



procedure BUILDMVSMATRIX(A) 

Step 1 : for r <— 1 : c do 

Step 2: DS\Sr^- ^£ J d i 

Step 3: nS \Sr<-|S \ Sri 
Step 4: end for 
Step 5: for i <— 1 : n do 
Step 6: r <— class of di 
Step 7: for j <— 1 : n do 

Step 8: if dj e Srthen 

Step 9: 

Step 10: else 



^S\&- A i d t D S\Sr-d 3 ^ 



Step 11 
Step 12 
Step 13 
Step 14 
Step 15 



end if 
end for 
end for 

return A = {aij }nxn 



Firstly, the outer composite with respect to each class is determined. Then, for each row ai of A, i = 1, . . . , n, if the pair of 
documents di and dj, j = 1, . . . , n are in the same class, aij is calculated as in line 9. Otherwise, dj is assumed to be in di's 
class, and aij is calculated as in line 11. 



After matrix A is formed, the code in Procedure 2 is used to get its validity score: 

procedure GETYALIDTTY (validity, A, percentage) 

Step 1 : for r <— 1 : c do 

Step 2: qr <— floor(percentage x nr) 

Step 3: if qr = 0then 

Step 4: qr ^ 1 

Step 5: end if 

Step 6: end for 

Step 7: for i <— 1 : n do 

Step 8: {aiv[l], . . . , aiv[n] } ^Sort {ail, . . . , ain} 
Step 9: s.t. aiv[l] > aiv[2] > . . . > aiv[n] 

{v[l], . . . , v[n]} <— permute {1, . . . , n} 
Step 10: r <— class of di 

vaimy(di} ^ l{4[ 1 ],...A M }ns r | 

Step 11: 9r 
Step 12: end for 

i 'j - * ES-i validityidi) 
validity t— 

Step 13: n 
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Step 14: return validity 

For each document di corresponding to row ai of A, we select qr documents closest to point di. The value of qr is chosen 
relatively as the percentage of the size of the class r that contains di, where percentage G (0, 1]. Then, validity with respect 
to di is calculated by the fraction of these qr documents having the same class label with di, as in line 11. The final validity is 
determined by averaging the over all the rows of A, as in line 13. It is clear that the validity score is bounded within 0 and 1. 
The higher validity score a similarity measure has, the more suitable it should be for the clustering process. 

iv. Conclusion 

Clustering is one of the data mining and text mining techniques used to analyze datasets by dividing it into various 
meaningful groups. The objects in the given dataset can have certain relationships among them. All the clustering algorithms 
assume this before they are applied to datasets. The existing algorithms for the text mining make use of a single viewpoint 
for measuring similarity between objects. Their drawback is that the clusters cannot exhibit the complete set of relationships 
among objects. To overcome this drawback, we propose a new similarity measure known as the multi -viewpoint based 
similarity measure to ensure the clusters show all relationships among objects. This approach makes use of different 
viewpoints from different objects of the multiple clusters and more useful assessment of similarity could be achieved. 
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